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Abstract 

The  convergence  of  mobile  computing  and  cloud  computing  enables  new  multimedia  applications  that  are  both  resource¬ 
intensive  and  interaction-intensive.  For  these  applications,  end-to-end  network  bandwidth  and  latency  matter  greatly  when 
cloud  resources  are  used  to  augment  the  computational  power  and  battery  life  of  a  mobile  device.  We  first  present  quantitative 
evidence  that  this  crucial  design  consideration  to  meet  interactive  performance  criteria  limits  data  center  consolidation.  We 
then  describe  an  architectural  solution  that  is  a  seamless  extension  of  today’s  cloud  computing  infrastructure. 
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1  Introduction 


The  convergence  of  cloud  computing  and  mobile  computing  has  begun.  Apple’s  Sin  for  the  iPhone  [1],  which 
performs  compute-intensive  speech  recognition  in  the  cloud,  hints  at  the  rich  commercial  opportunities  in  this 
emerging  space.  Rapid  improvements  in  sensing,  display  quality,  connectivity,  and  compute  power  of  mobile 
devices  will  lead  to  new  cloud-enabled  mobile  applications  that  embody  voice-,  image-,  motion-  and  location- 
based  interactivity.  Siri  is  just  the  leading  edge  of  this  disruptive  force. 

Many  of  these  new  applications  will  be  interactive  as  well  as  resource-intensive,  pushing  well  beyond  the 
processing,  storage,  and  energy  limits  of  mobile  devices.  When  their  use  of  cloud  resources  is  in  the  critical  path 
of  user  interaction,  end-to-end  operation  latencies  can  be  no  more  than  a  few  tens  of  milliseconds.  Violating  this 
bound  results  in  distraction  and  annoyance  to  a  mobile  user  who  is  already  attention-challenged.  Such  fine-grained 
cloud  usage  is  different  from  the  coarse-grained  usage  models  and  SLA  guarantees  of  cloud  computing  today. 

The  central  contribution  of  this  paper  is  the  experimental  evidence  that  these  new  applications  force  a  funda¬ 
mental  change  in  cloud  computing  architecture.  We  describe  five  example  applications  of  this  genre  in  Section  2, 
and  experimentally  demonstrate  in  Section  3  that  even  with  the  rapid  improvements  predicted  for  mobile  computing 
hardware,  such  applications  will  benefit  from  cloud  resources.  The  remainder  of  the  paper  explores  the  architectural 
implications  of  this  class  of  applications.  In  the  past,  centralization  was  the  dominant  theme  of  cloud  computing. 
This  is  reflected  in  the  consolidation  of  dispersed  compute  capacity  into  a  few  large  data  centers.  For  example, 
Amazon  Web  Services  spans  the  entire  planet  with  just  a  handful  of  data  centers  located  in  Oregon,  N.  Califor¬ 
nia,  Virginia,  Ireland,  Singapore,  Tokyo,  and  Sao  Paolo.  The  underlying  value  proposition  of  cloud  computing  is 
that  centralization  exploits  economies  of  scale  to  lower  the  marginal  cost  of  system  administration  and  operations. 
These  economies  of  scale  evaporate  if  too  many  data  centers  have  to  be  maintained  and  administered. 

Aggressive  global  consolidation  of  data  centers  implies  large  average  separation  between  a  mobile  device  and 
its  cloud.  End-to-end  communication  then  involves  many  network  hops  and  results  in  high  latencies,  as  quantified 
in  Section  4  using  measurements  from  Amazon  EC2.  Under  these  conditions,  achieving  crisp  interactive  response 
for  latency-sensitive  mobile  applications  will  be  a  challenge.  Limiting  consolidation  and  locating  small  data  centers 
much  closer  to  mobile  devices  would  solve  this  problem,  but  it  would  sacrifice  the  key  benefit  of  cloud  computing. 

How  do  we  achieve  the  right  balance?  Can  we  support  latency-sensitive  and  resource-intensive  mobile  ap¬ 
plications  without  sacrificing  the  consolidation  benefits  of  cloud  computing?  Section  5  shows  how  a  two-level 
architecture  can  reconcile  this  conflict.  The  first  level  of  this  hierarchy  is  today’s  unmodified  cloud  infrastructure. 
The  second  level  is  new.  It  consists  of  dispersed  but  unmanaged  infrastructure  with  no  hard  state.  Each  second- 
level  element  is  effectively  a  “second-class  data  center”  with  soft  state  generated  locally  or  cached  on  demand  from 
the  first  level.  Data  center  proximity  to  mobile  devices  is  thus  achieved  by  the  second  level  without  limiting  the 
consolidation  achievable  at  the  first  level.  Communication  between  first  and  second  levels  is  outside  the  critical 
path  of  interactive  mobile  applications.  This  hierarchical  structure  also  has  an  additional  benefit.  As  discussed  in 
Section  6,  it  improves  availability  when  cloud  connectivity  is  fragile  and  prone  to  disruption. 

Throughout  this  paper,  the  term  “cloud  computing”  refers  to  transient  use  of  computational  cloud  resources  by 
mobile  clients.  Other  forms  of  cloud  usage  such  as  processing  of  large  datasets  (data-intensive  computing)  and 
asynchronous  long-running  computations  (agent-based  computing)  are  outside  the  scope  of  this  paper. 

2  Mobile  Multimedia  Applications 

Beyond  today’s  familial-  desktop,  laptop,  and  smartphone  applications  is  a  new  genre  of  software  to  seamlessly 
augment  human  perception  and  cognition.  Consider  Watson,  IBM’s  question-answering  technology  that  publicly 
demonstrated  its  prowess  in  201 1  [2].  Imagine  such  a  tool  being  available  anywhere  and  anytime  to  rapidly  respond 
to  urgent  questions  posed  by  an  attention-challenged  mobile  user.  Such  a  vision  may  be  within  reach  in  the  next 
decade.  Free-form  speech  recognition,  natural  language  translation,  face  recognition,  object  recognition,  dynamic 
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action  interpretation  from  video,  and  body  language  interpretation  are  other  examples  of  this  genre  of  futuristic 
applications.  Although  a  full-fledged  cognitive  assistance  system  is  out  of  reach  today,  we  investigate  several 
smaller  applications  that  arc  building  blocks  towards  this  vision.  Five  such  applications  arc  described  below. 

2.1  Face  Recognition  (Face) 

A  most  basic  and  fundamental  perception  task  is  the  recognition  of  human  faces.  The  problem  has  been  long  studied 
in  the  computer  vision  community,  and  fast  algorithms  for  detecting  human  faces  in  images  have  been  available  for 
some  time  [3],  Identification  of  individuals  through  computer  vision  is  still  an  area  of  active  research,  spurred  by 
applications  in  security  and  surveillance  tasks.  However,  such  technology  is  also  very  useful  in  mobile  devices  for 
personal  information  management  and  cognitive  assistance.  For  example,  an  application  that  can  recognize  a  face 
and  remind  you  who  it  is  (by  name,  contact  information,  or  context  in  which  you  last  met)  can  be  quite  useful  to 
everyone,  and  invaluable  to  those  with  cognitive  or  visual  impairments.  Such  an  application  is  most  useful  if  it  can 
be  used  anywhere,  and  can  quickly  provide  a  response  to  avoid  potentially  awkward  social  situations. 

The  face  recognition  application  studied  here  detects  faces  in  an  image,  and  attempts  to  identify  the  face  from 
a  prepopulated  database.  The  application  uses  a  Haar  Cascade  of  classifiers  to  do  the  detection,  and  then  uses  the 
Eigenfaces  method  [4]  based  on  principal  component  analysis  (PCA)  to  make  an  identification.  The  implementa¬ 
tion  is  based  on  OpenCV  [5]  image  processing  and  computer  vision  routines,  and  runs  on  a  Microsoft  Windows 
environment.  Training  the  classifiers  and  populating  the  database  arc  done  offline,  so  our  experiments  only  consider 
the  execution  time  of  the  recognition  task  on  a  pre-trained  system. 

2.2  Speech  Recognition  (Speech) 

Speech  as  a  modality  of  interaction  between  human  users  and  computers  is  a  long  studied  area  of  research.  Most 
success  has  been  in  very  specific  domains  or  in  applications  requiring  a  very  limited  vocabulary,  such  as  interactive 
voice  response  in  phone  answering  services,  and  hands-free,  in-vehicle  control  of  cell  phones.  Several  recent  com¬ 
mercial  efforts  aim  for  general  purpose  information  query,  device  control,  and  language  translation  using  speech 
input  on  mobile  devices  [1,  6,  7], 

The  speech  recognition  application  considered  here  is  based  on  an  open-source  speech-to-text  framework  based 
on  Hidden  Markov  Model  (HMM)  recognition  systems  [8] .  It  takes  as  input  digitized  audio  of  a  spoken  English 
sentence,  and  attempts  to  extract  all  of  the  words  in  plain  text  format.  This  application  is  single-threaded.  Since  it 
is  written  in  Java,  it  can  run  on  both  Linux  and  Microsoft  Windows.  For  this  paper,  we  ran  it  on  Linux. 

2.3  Object  and  Pose  Identification  (Object) 

A  third  application  is  based  on  a  computer  vision  algorithm  originally  developed  for  robotics  [9] ,  but  modified  for 
use  by  handicapped  users.  The  computer  vision  system  identifies  known  objects,  and  importantly,  also  recognizes 
the  position  and  orientation  of  the  objects  relative  to  the  user.  This  information  is  then  used  to  guide  the  user  in 
manipulating  a  particular  object. 

Here,  the  application  identifies  and  locates  known  objects  in  a  scene.  The  implementation  runs  on  Linux,  and 
makes  use  of  multiple  cores.  The  system  extracts  key  visual  elements  (SILT  features  [10])  from  an  image,  matches 
these  against  a  database  of  features  from  a  known  set  of  objects,  and  finally  performs  geometric  computations 
to  determine  the  pose  of  the  identified  object.  Lor  the  experiments  in  this  paper,  the  database  is  populated  with 
thousands  of  features  extracted  from  more  than  500  images  of  13  different  objects. 

2.4  Mobile  Augmented  Reality  (AugReal) 

The  defining  property  of  a  mobile  augmented  reality  application  is  the  display  of  timely  and  relevant  information 
as  an  overlay  on  top  of  a  live  view  of  some  scene.  Lor  example,  it  may  show  street  names,  restaurant  ratings 
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Figure  1:  Evolution  of  Hardware  Performance  (adapted  from  Flinn  [15]) 


or  directional  arrows  overlaid  on  the  scene  captured  through  a  smartphone’s  camera.  Special  mobile  devices  that 
incorporate  cameras  and  see-through  displays  in  a  wearable  form  factor  [1 1]  can  be  used  instead  of  a  smartphone. 

AugReal  uses  computer  vision  to  identify  actual  buildings  and  landmarks  in  a  scene,  and  label  them  precisely 
in  the  view  [12].  This  is  akin  to  an  image-based  query  in  Google  Goggles  [13],  but  running  continuously  on  a 
live  video  stream.  AugReal  extracts  a  set  of  features  from  the  scene  image,  and  uses  the  feature  descriptors  to 
find  similar-looking  entries  in  a  database  constructed  using  features  from  labeled  images  of  known  landmarks  and 
buildings.  The  database  search  is  kept  tractable  by  spatially  indexing  the  data  by  geographic  locations,  and  limiting 
search  to  a  slice  of  the  database  relevant  to  the  current  GPS  coordinates.  The  prototype  uses  1005  labeled  images 
of  200  buildings  as  the  relevant  database  slice.  AugReal  runs  on  Microsoft  Windows,  and  makes  significant  use 
of  OpenCV  libraries  [5],  Intel  Performance  Primitives  (IPP)  libraries,  and  multiple  processing  threads. 

2.5  Physical  Simulation  and  Rendering  (Fluid) 

Our  final  application  is  used  in  computer  graphics.  Using  accelerometer  readings  from  a  mobile  device,  it  phys¬ 
ically  models  the  motion  of  imaginary  fluids  with  which  the  user  can  interact.  For  example,  it  can  show  liquid 
sloshing  around  in  a  container  depicted  on  a  smartphone  screen,  such  as  a  glass  of  water  carried  by  the  user  as  he 
walks  or  runs.  The  application  backend  runs  a  physics  simulation,  based  on  the  predictive -corrective  incompress¬ 
ible  smoothed  particles  hydrodynamics  (PCISPH)  method  [14],  We  note  that  the  computational  structure  of  this 
application  is  representative  of  many  other  applications,  particularly  “real-time”  (i.e.,  not  turn-based)  games. 

Fluid  is  implemented  as  a  multithreaded  Finux  application.  To  ensure  a  good  interactive  experience,  the 
delay  between  user  input  and  output  state  change  has  to  be  very  low,  on  the  order  of  100ms.  In  our  experiments. 
Fluid  simulates  a  2218  particle  system  with  20  ms  timesteps,  generating  up  to  50  frames  per  second. 

3  Why  Cloud  Resources  are  Necessary 

3.1  Mobile  Hardware  Performance 

Handheld  or  body-worn  mobile  devices  are  always  resource-poor  relative  to  server  hardware  of  comparable  vin¬ 
tage  [16].  Figure  1,  adapted  from  Flinn  [15],  illustrates  the  consistent  large  gap  in  the  processing  power  of  typical 
server  and  mobile  device  hardware  over  a  1 5 -year  period.  This  stubborn  gap  reflects  a  fundamental  reality  of  user 
preferences:  Moore’s  Law  has  to  be  leveraged  differently  on  hardware  that  people  carry  or  wear  for  extended  pe¬ 
riods  of  time.  This  is  not  just  a  temporary  limitation  of  current  mobile  hardware  technology,  but  is  intrinsic  to 
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Figure  2:  Dell  Netbook  Device  Used  in  Experiments 
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Figure  3:  Average  response  time  of  applications  on  mobile  device  under  different  conditions  (see  Sect.  3.2) 


mobility.  The  most  sought-after  features  of  a  mobile  device  include  light  weight,  small  size,  long  battery  life,  com¬ 
fortable  ergonomics,  and  tolerable  heat  dissipation.  Processor  speed,  memory  size,  and  disk  capacity  arc  secondary. 

Our  experiments  use  a  Dell  Latitude  2102  as  the  mobile  device.  This  small  netbook  machine  is  more  powerful 
than  a  typical  smartphone  today  (Figure  2),  but  it  is  representative  of  mobile  devices  in  the  near  future. 

3.2  Extremes  of  Resource  Demands 

At  first  glance,  it  may  appeal-  that  today’s  smartphones  are  already  powerful  enough  to  support  mobile  multimedia 
applications  without  leveraging  cloud  resources.  Some  digital  cameras  and  smartphones  support  built-in  face  de¬ 
tection.  Android  4.0  APIs  support  tracking  of  multiple  faces  and  give  detailed  information  about  the  location  of 
eyes  and  mouth  [17],  Google’s  “Voice  Actions  for  Android”  performs  voice  recognition  to  allow  hands-free  control 
of  a  smartphone  [18].  Lowe  [19]  describes  many  computer  vision  applications  that  run  on  mobile  devices  today. 

However,  upon  closer  examination,  the  situation  is  much  more  complex  and  subtle.  Consider  computer  vision, 
for  example.  Its  computational  requirements  vary  drastically  depending  on  the  operational  conditions.  For  example, 
it  is  possible  to  develop  (near)  frame -rate  object  recognition  (including  face  recognition  [20])  operating  on  mobile 
computers  if  we  assume  restricted  operational  conditions  such  as  a  small  number  of  models  ( e.g .,  small  number 
of  identities  for  person  recognition),  and  limited  variability  in  observation  conditions  {e.g.,  frontal  faces  only). 
The  computational  demands  greatly  increase  with  the  generality  of  the  problem  formulation.  For  example,  just  two 
simple  changes  make  a  huge  difference:  increasing  the  number  of  possible  faces  from  just  a  few  close  acquaintances 
to  the  entire  set  of  people  known  to  have  entered  a  building,  and  reducing  the  constraints  on  the  observation 
conditions  by  allowing  faces  to  be  at  arbitrary  viewpoints  from  the  observer. 

To  illustrate  the  great  variability  of  execution  times  possible  with  perception  applications,  we  perform  a  set 
of  experiments  using  two  of  the  applications  discussed  earlier.  We  run  the  Speech  and  Face  applications  on 
the  mobile  platform,  and  measure  the  response  times  for  a  wide  variety  of  inputs.  Figure  3  shows  the  results. 
For  the  speech  application,  execution  times  generally  increase  with  the  number  of  words  the  algorithm  recognizes 
(correctly  or  otherwise)  in  an  utterance.  Conditions  1,  2,  3  for  this  application  correspond  to  sentences  in  which 
no  words,  1-5  words,  and  6-22  words  are  recognized,  respectively.  The  response  time  varies  quite  dramatically, 
by  almost  2  orders  of  magnitude,  and  is  acceptable  only  when  the  application  fails  to  recognize  any  words.  When 
short  phrases  are  correctly  recognized,  the  response  time  is  marginal,  at  just  over  1  second,  on  average.  For  longer 
sentences,  when  the  application  works  at  ah,  it  just  takes  too  long.  For  comparison,  Agus  et  al.  [21]  report  that 
human  subjects  recognize  short  target  phrases  within  300  to  450  ms,  and  are  able  to  tell  that  a  sound  is  a  human 
voice  within  a  mere  4  ms. 
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SPEECH 

FACE 

1.22  s  6.69  s 
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Figure  4:  Response  times  with  and  without  cloud  resources. 


In  the  case  of  the  face  recognition  application,  the  best  response  times  occur  when  there  is  a  single,  large, 
recognizable  face  in  the  image.  These  correspond  to  Condition  1  in  Figure  3.  It  fares  the  worst  when  it  searches  in 
vain  at  smaller  and  smaller  scales  for  a  face  in  an  image  without  any  faces  (Condition  2).  Unfortunately,  response 
time  is  close  to  the  latter  for  images  that  only  contain  small  faces.  At  close  to  4-second  average  response  time 
in  these  conditions,  this  application  is  unacceptably  slow.  For  comparison,  recent  experimental  results  on  human 
subjects  by  Ramon  et  al.  [22]  show  that  recognition  times  under  controlled  conditions  range  from  370  milliseconds 
for  the  fastest  responses  on  familial-  faces  to  620  milliseconds  for  the  slowest  response  on  an  unfamiliar  face.  Lewis 
et  al.  [23]  report  that  human  subjects  take  less  than  700  milliseconds  to  determine  the  absence  of  faces  in  a  scene, 
even  under  hostile  conditions  such  as  low  lighting  and  deliberately  distorted  optics. 

Such  data-dependent  and  context-dependent  tradeoffs  apply  across  the  board  to  virtually  all  applications  of 
this  genre.  In  continuous  use  under  the  widest  possible  range  of  operating  conditions,  providing  near  real-time 
responses,  and  tuned  for  very  low  error  rates,  these  applications  have  ravenous  appetites  for  processing,  memory 
and  energy  resources.  They  can  easily  overwhelm  a  mobile  device. 

3.3  Improvement  from  Cloud  Computing 

Performance  improves  considerably  when  cloud  resources  are  leveraged.  Figure  4  shows  the  median  and  99th  per¬ 
centile  response  times  for  the  Speech  and  Face  experiments  of  Figure  3  with  and  without  use  of  cloud  resources. 
For  the  speech  case,  we  leverage  an  Amazon  EC2  instance.  For  the  face  recognition  application,  we  use  a  private 
cloud.  Although  variability  in  execution  times  still  exists,  the  absolute  response  times  are  significantly  improved. 
These  experiments  confirm  that  cloud  resources  can  improve  user  experience  for  mobile  multimedia  applications. 

4  Effects  of  Cloud  Location 

In  reality,  “the  cloud”  is  an  abstraction  that  maps  to  services  in  sparsely  scattered  data  centers  across  the  globe.  As 
a  user  travels,  his  mobile  device  experiences  high  variability  in  the  end-to-end  network  latency  and  bandwidth  to 
these  data  centers.  We  examine  the  significance  of  this  variability  for  mobile  multimedia  applications.  Response 
time  for  remote  operations  is  our  primary  metric.  Energy  consumed  on  the  mobile  device  is  a  secondary  metric. 
Application-specific  metrics  such  as  frame  rate  are  also  relevant. 

4.1  Variable  Network  Quality  to  the  Cloud 

In  this  paper,  we  focus  on  Amazon  EC2  services  provided  by  several  data  centers  worldwide.  We  use  the  labels 
“East,”  “West,”  “EU,”  and  “Asia”  to  refer  to  the  data  centers  located  in  Virginia,  Oregon,  Ireland  and  Singapore. 
We  measured  end-to-end  latency  and  bandwidth  to  these  data  centers  from  a  WiFi-connected  mobile  device  located 
on  our  campuses  in  Pittsburgh,  PA  and  Lancaster,  UK.  We  also  repeated  these  measurements  from  off-campus  sites 
with  excellent  last-mile  connectivity  in  these  two  cities.  Figure  5  and  6  present  our  measurements,  and  quantify  our 
intuition  that  a  traveling  user  will  experience  highly  variable  cloud  connectivity.  There  are  also  some  surprises. 

One  surprise  is  the  amazingly  good  connectivity  to  EC2  East  from  our  Pittsburgh,  PA  campus.  From  a  wired 
connection,  we  measured  8  ms  ping  times  and  200  Mbps  transfer  rates  to  this  site.  Such  numbers  are  more  typical 
of  LAN  connections  than  WAN  transfers !  We  believe  that  this  is  due  to  particularly  favorable  network  routing 
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Figure  5:  Measured  Network  Quality  to  Amazon  EC2  Sites  from  Carnegie  Mellon  University  (Pittsburgh,  PA) 
(’’Ideal”  is  at  speed  of  light) 
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Figure  6:  Measured  Network  Quality  to  Amazon  EC2  Sites  from  Lancaster  University  (Lancaster,  UK) 
(’’Ideal”  is  at  speed  of  light) 


between  our  campus  and  the  EC2  East  site.  This  hypothesis  is  confirmed  by  the  poorer  off-campus  measurements 
shown  in  Figure  5.  Thus,  our  EC2  East  on-campus  results  best  serve  to  indicate  what  one  can  expect  from  a  LAN- 
connected  private  cloud.  Li  et  al.  [24]  report  that  average  round  trip  time  (RTT)  from  260  global  vantage  points  to 
their  optimal  Amazon  EC2  instances  is  73.68  ms.  Therefore,  the  EC2  West  numbers  in  Figure  5  are  more  typical 
of  cloud  connectivity. 

Another  surprise  is  the  great  range  of  bandwidths  observed,  particularly  the  upload/download  asymmetry  and 
the  significant  variation  between  experiments.  To  mitigate  this  time-varying  factor,  we  scheduled  our  experiments 
on  weekday  nights  when  conditions  were  stable  and  bandwidth  consistently  high.  All  experiments  in  the  rest  of  the 
paper  were  run  under  these  conditions  on  campus  in  Pittsburgh. 

4.2  Impact  on  Response  Time 

We  next  evaluate  how  cloud  connectivity  affects  the  applications  described  in  Section  2.  We  consider  six  cases.  The 
first,  labeled  “Mobile,”  runs  the  application  entirely  on  the  mobile  device.  Cloud  connectivity  is  irrelevant,  but  the 
resource  constraints  of  the  mobile  device  dominate.  In  cases  2-5,  the  mobile  device  performs  the  resource-intensive 
part  of  each  operation  on  one  of  the  four  Amazon  data  centers  and  blocks  until  it  receives  the  result. 

The  sixth  case,  labeled  “lWiFi,”  corresponds  to  the  theoretical  best-case  for  data  center  location.  With  today’s 
deployed  wireless  technology,  this  is  exactly  one  WiFi  hop  away  from  a  mobile  device.  This  can  only  be  approx¬ 
imated  today  in  special  situations:  e.g.,  on  a  WiFi-covered  campus,  with  access  points  connected  to  a  private  data 
center  through  a  lightly-loaded  gigabit  LAN  backbone.  If  naively  implemented  at  global  scale,  lWiFi  would  lead 
to  a  proliferation  of  data  centers.  Section  5  discusses  how  the  consolidation  benefits  of  cloud  computing  can  be 
preserved  while  scaling  out  the  lWiFi  configuration. 
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Figure  7 :  Platform  specifications 


Figure  7  compares  the  characteristics  of  the  compute  platforms  used  in  our  configurations.  For  lWiFi,  we 
create  a  minimal  data  center  using  a  six-year  old  WiFi-connected  server.  The  choice  of  this  near-obsolete  machine 
is  deliberate.  By  comparing  it  against  a  fast  mobile  device  and  fast  EC2  cloud  instances,  we  have  deliberately 
stacked  the  deck  against  lWiFi.  Hence,  any  wins  by  this  strategy  in  our  experiments  arc  quite  meaningful. 

FACE:  Figure  8  summarizes  the  response  times  measured  for  Face  under  different  conditions.  Here,  we  test 
with  300  images  that  may  have  known  faces,  unknown  faces,  or  no  faces  at  all.  Processing  on  the  mobile  device 
alone  can  provide  tolerable  response  times  for  the  easier  images,  but  is  crushed  by  the  heavy-tailed  distribution  of 
processing  costs.  Only  lWiFi  can  provide  fast  response  (<200ms)  most  of  the  time,  and  a  tolerable  worst  case 
response  time.  Hence,  lWiFi  is  the  best  approach  to  running  FACE. 

SPEECH:  Results  for  SPEECH  arc  somewhat  different  (Figure  9).  Here,  the  application  generally  requires  sig¬ 
nificant  processing  for  each  query,  and  data  transfer  costs  arc  modest.  This  changes  the  relative  performance  of 
the  strategies  significantly.  As  the  response  time  is  dominated  by  processing  time,  this  favors  the  more  capable  but 
distant  servers  in  the  cloud  over  the  weak  lWiFi  server.  Processing  without  cloud  assistance  is  out  of  the  question. 
For  Speech,  using  the  closest  EC2  data  center  is  the  winning  strategy.  To  understand  the  effect  of  a  more  powerful 
lWiFi  machine,  we  repeated  that  experiment  with  an  Intel  i-3770  desktop.  The  results  shown  in  Figure  14  confirm 
that  lWiFi  now  dominates  the  alternatives. 

OBJECT:  Compared  to  the  previous  two  applications.  Object  requires  significantly  greater  compute  resources. 
Unfortunately,  the  processing  load  is  so  large  that  none  of  the  approaches  yield  acceptable  interactive  response 
times  (Figure  10).  This  application  really  needs  more  resources  than  our  single  VM  instances  or  weak  lWiFi 
server  can  provide.  To  bring  response  times  down  to  reasonable  levels  for  interactive  use,  we  will  either  need  to 
parallelize  the  application  beyond  a  single  machinc/VM  boundary  and  employ  a  processing  cluster,  or  make  use  of 
GPU  hardware  to  accelerate  critical  routines.  Both  of  these  potential  solutions  are  beyond  the  scope  of  this  paper. 
Using  the  faster  lWiFi  machine  (Intel  i-3770)  does  help  significantly  (Figure  15). 

AugReal:  This  application  employs  a  low-cost  feature  extraction  algorithm,  and  an  efficient  approximate 
nearest-neighbor  algorithm  to  match  features  in  its  database.  While  these  processing  costs  arc  modest,  data  transfer 
costs  are  high  because  of  image  transmission.  Therefore,  as  shown  in  Figure  1 1,  none  of  the  EC2  cases  is  adequate 
for  this  application.  They  generally  provide  slower  response  times  than  execution  on  the  mobile  device.  lWiFi, 
on  the  other  hand,  works  extremely  well  for  this  application,  providing  very  fast  response  times  (around  100ms) 
needed  for  crisp  interactions.  This  is  clearly  the  winning  strategy  for  AugReal. 

FLUID:  Response  time  for  Fluid  is  defined  as  the  time  between  the  sensing  of  a  user  action  (i.e.,  accelerometer 
reading),  to  when  that  input  is  reflected  in  the  output.  This  largely  reflects  three  factors:  the  execution  time  of 
a  simulation  step,  network  latency,  and  data  transfer  time  for  a  frame  from  the  simulation  thread.  As  seen  in 
Figure  12,  local  execution  on  the  mobile  device  produces  good  response  times,  since  all  but  the  first  factor  are 
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Figure  8:  FACE:  Cumulative  distribution  function  (CDF)  of  response  times  in  ms  (300  images). 
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Figure  9:  Speech:  CDF  of  response  times  in  ms  (500  wav  tiles,  each  recording  one  sentence). 
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Figure  10:  Object:  CDF  of  response  times  in  ms  (300  images). 
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Figure  11:  AugReal:  CDF  of  response  times  in  ms  (100  images). 
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Figure  12:  Fluid:  CDF  of  response  times  in  ms  (10  min. 
(see  also  Figure  13) 
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Figure  13:  Simulation  speed,  frame  rate  for  Fluid. 


Figure  14:  Experiments  of  Figure  9  repeated  with  faster  lWiFi  machine 


Figure  15:  Experiments  of  Figure  10  repeated  with  faster  lWiFi  machine 

essentially  zero.  However,  simulation  speed  and  frame  rate  also  need  to  be  considered  (Figure  13).  The  simulation 
runs  asynchronously  to  the  inputs  and  display,  and  tries  to  match  simulated  time  with  wall-clock  time.  Since  the 
mobile  device  cannot  execute  the  simulation  steps  fast  enough,  fluid  motions  are  less  than  one  fifth  of  realistic 
speeds.  The  cloud  strategies  do  not  have  this  issue,  but  due  to  bandwidth  and  network  latencies,  cannot  deliver 
the  results  of  the  simulation  fast  enough  to  sustain  the  full  frame  rate.  Only  lWiFi  and  East  can  deliver  both  good 
responsiveness  and  high  frame  rates. 


4.3  Impact  on  Energy  Usage 

Battery  life  is  a  key  attribute  of  a  mobile  device.  Executing  resource -intensive  operations  in  the  cloud  can  greatly 
reduce  the  energy  consumed  on  the  mobile  device  by  the  processor(s),  memory  and  storage.  However,  it  increases 
network  use  and  wireless  energy  consumption.  Since  peak  processor  power  consumption  exceeds  wireless  power 
consumption  on  today’s  high-end  mobile  devices,  this  tradeoff  favors  cloud  processing  as  computational  demands 
increase.  Network  latency  has  recently  been  shown  to  increase  energy  consumption  for  remote  execution  by  as 
much  as  50%,  even  if  bandwidth  and  computation  are  held  constant  [25,  15].  This  is  because  hardware  elements  of 
the  mobile  device  remain  in  higher-power  states  for  longer  periods  of  time. 

Figure  16  summarizes  energy  consumption  on  our  mobile  device  for  the  experiments  described  in  Section  4.2. 
For  each  application,  the  first  row  shows  the  power  dissipation  in  watts,  averaged  over  the  whole  experiment.  In 
all  cases,  this  quantity  shows  little  variation  across  data  centers.  Local  execution  on  the  mobile  device  incurs  the 
highest  power  dissipation.  Note  that  the  netbook  platform  has  a  high  baseline  idle  power  dissipation  (around  10W), 
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Figure  16:  Energy  consumption  on  mobile  device 


so  the  relative  improvement  in  power  is  likely  to  be  larger  on  more  energy-efficient  hardware. 

Average  power  dissipation  only  tells  paid  of  the  story.  Cloud  use  also  tends  to  shorten  the  time  to  obtain  a 
result.  When  this  is  factored  in,  the  energy  consumed  per  query  or  frame  is  dramatically  improved.  These  results 
arc  shown  in  the  second  row  for  each  application  in  Figure  16.  In  the  best  case,  the  energy  consumed  per  result  is 
reduced  by  a  factor  of  3  to  6.  The  strategies  that  exhibit  the  greatest  energy  efficiency  are  also  the  ones  that  give 
the  best  response  times. 

4.4  Summary  and  Discussion 

The  results  of  Sections  4.2  and  4.3  confirm  that  logical  proximity  to  data  center  is  essential  for  mobile  applications 
that  arc  highly  interactive  and  resource  intensive.  By  “logical  proximity”  we  mean  the  end-to-end  properties  of 
high  bandwidth,  low  latency  and  low  jitter.  Physical  proximity  is  only  weakly  correlated  with  logical  proximity 
because  of  the  well-known  “last  mile”  problem  [26]. 

1  WiFi  represents  the  best  attainable  logical  proximity.  Our  results  show  that  this  extreme  case  is  indeed  valuable 
for  many  of  the  applications  studied,  both  in  terms  of  response  time  and  energy  efficiency.  It  is  important  to  keep  in 
mind  that  these  are  representative  of  a  new  genre  of  cognitive  assistance  applications  that  are  inspired  by  the  sensing 
and  user  interaction  capabilities  of  mobile  devices.  Mobile  participation  in  server-based  multiplayer  games  such  as 
Doom  3  is  another  use  case  that  can  benefit  from  logical  proximity  [27].  The  emergence  of  such  applications  can  be 
accelerated  by  deploying  infrastructure  that  assures  mobile  users  of  continuous  logical  proximity  to  the  cloud.  The 
situation  is  analogous  to  the  dawn  of  personal  computing,  when  the  dramatic  lowering  of  user  interaction  latency 
relative  to  time-sharing  led  to  entirely  new  application  metaphors  such  as  spreadsheets  and  WYSIWYG  editors. 


5  Scaling  Out  lWiFi 

We  thus  face  contradictory  requirements.  On  the  one  hand,  the  lWiFi  property  is  valuable  for  mobile  computing. 
On  the  other  hand,  it  works  against  consolidation  because  there  have  to  be  many  data  centers  at  the  edges  of 
the  Internet  to  ensure  lWiFi  cloud  access  everywhere.  Consolidation  is  the  essence  of  cloud  computing  because 
dispersion  induces  diseconomies  of  scale:  the  marginal  cost  of  administering  machines  in  a  centralized  data  center 
is  typically  lower  than  when  they  are  spread  over  smaller  data  centers.  How  can  we  resolve  this  tension? 

5.1  Concept 

We  assert  that  the  only  practical  solution  to  this  problem  is  a  hierarchical  organization  of  data  centers,  as  shown  in 
Figure  17.  Fevel  1  of  this  hierarchy  is  today’s  unmodified  cloud  infrastructure  such  as  Amazon’s  EC2  data  centers. 
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Figure  17:  Two-level  Flierarchical  Cloud  Architecture 


Level  2  consists  of  stateless  data  centers  at  the  edges  of  the  Internet,  servicing  currently-associated  mobile  devices. 

We  envision  an  appliance-like  deployment  model  for  Level  2  data  centers.  They  are  not  actively  managed  after 
installation.  Instead,  soft  state  (such  as  virtual  machine  images  and  files  from  a  distributed  file  system)  is  cached 
on  their  local  storage  from  one  or  more  Level  1  data  centers.  It  is  the  absence  of  hard  state  at  Level  2  that  keeps 
management  overhead  low.  Consolidation  or  reconfiguration  of  Level  1  data  centers  does  not  affect  the  lWiFi 
property  at  Level  2.  Adding  a  new  Level  2  data  center  or  replacing  an  existing  one  only  requires  modest  setup  and 
configuration.  Once  configured,  a  Level  2  data  center  can  dynamically  self-provision  from  Level  1  data  centers. 

Physical  motion  of  a  mobile  device  may  take  it  far  from  the  Level  2  data  center  with  which  it  is  currently 
associated.  Beyond  a  certain  distance,  the  lWiFi  property  may  no  longer  hold.  In  that  case,  a  mechanism  similar 
to  wireless  access  point  handoff  can  be  executed  to  seamlessly  switch  association  to  a  different  Level  2  data  center. 


(a)  Outdoor  (b)  Solar  Powered  (c)  Indoor 

Figure  18:  Unattended  Micro  Data  Centers  (Sources:  [28,  29]) 
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5.2  Physical  Realization 

The  hardware  technology  for  Level  2  data  centers  is  already  here  today  for  reasons  unrelated  to  mobile  computing. 
For  example,  Myoonet  has  pioneered  the  concept  of  micro  data  centers  for  use  in  developing  countries  (Figure  18(a) 
and  (b))  [28].  AOL  has  recently  introduced  indoor  micro-data  centers  for  enterprises  (Figure  18(c))  [29].  Today, 
these  micro  data  centers  arc  being  used  as  Level  1  data  centers  in  private  clouds.  By  removing  hard  state  and  adding 
self-provisioning,  they  can  be  repurposed  as  Level  2  data  centers.  In  the  future,  one  can  envision  optimized  Level 
2  data  centers  for  lWiFi.  For  example,  with  modest  engineering  effort,  a  WiFi  access  point  could  be  transformed 
into  a  “nano,”  ”pico,”  or  “femto”  Level  2  data  center  by  adding  processing,  memory  and  storage. 

While  much  innovation  and  evolution  will  undoubtedly  occur  in  the  form  factors  and  configurations  of  future 
Level  2  data  centers,  we  can  identify  four  key  attributes  that  any  such  implementation  must  possess: 

•  Only  soft  state:  It  does  not  have  any  hard  state,  but  only  cached  state  from  Level  1.  It  may  also  buffer  data 
from  a  mobile  device  en  route  to  Level  1. 

•  Powerful,  well-connected  and  safe:  It  is  powerful  enough  to  handle  resource-intensive  applications  from 
multiple  associated  mobile  devices.  Bandwidth  between  Level  1  and  Level  2  is  good,  typical  WAN  latency 
is  acceptable,  and  network  failures  are  rare.  Battery  life  is  not  a  concern.  It  is  a  trusted  computing  platform. 

•  Close  at  hand:  It  is  easily  deployable  within  one  Wi-Fi  hop  of  associated  mobile  devices. 

•  Builds  on  standard  cloud  technology:  It  leverages  and  reuses  Level  1  software  infrastructure  and  standards 
(e.g.  OpenStack)  as  much  as  possible. 

5.3  Operating  Environment 

There  is  significant  overlap  in  the  requirements  specifications  for  Levels  1  and  2.  At  both  levels,  there  is  the  need 
for:  (a)  strong  isolation  between  untrusted  user-level  computations;  (b)  mechanisms  for  authentication,  access 
control,  and  metering;  (c)  dynamic  resource  allocation  for  user-level  computations;  and,  (d)  the  ability  to  support 
a  very  wide  range  of  user-level  computations,  with  minimal  restrictions  on  their  process  structure,  programming 
languages  or  operating  systems.  At  Level  1,  these  requirements  arc  met  today  using  the  virtual  machine  (VM) 
abstraction.  For  precisely  the  same  reasons  they  arc  so  valuable  at  Level  1,  we  foresee  VMs  as  central  to  Level  2. 

A  rich  ecosystem  of  VM-based  mechanisms,  policies  and  practices  already  exists  for  Level  1,  but  some  changes 
may  be  needed  for  Level  2.  For  example,  managing  cooling  and  power  are  major  concerns  at  Level  1  but  are  less 
important  at  Level  2  because  data  centers  are  much  smaller  and  ease  of  deployment  is  the  dominant  concern. 

Trust  is  a  differentiator  between  the  two  levels.  A  Level  1  data  center  is  effectively  a  small  fort,  with  care¬ 
ful  attention  paid  to  physical  security  of  the  perimeter.  Tampering  of  hardware  within  Level  1  is  assumed  to  be 
impossible.  Mechanisms  such  as  TPM-based  attestation  arc  therefore  not  often  used  at  this  level.  In  contrast,  a 
Level  2  data  center  has  weak  perimeter  security  even  if  it  is  located  in  a  locked  closet  or  above  the  ceiling.  Hence, 
tamper-resistant  and  tamper-evident  enclosures,  remote  surveillance,  and  TPM-based  attestation  will  all  be  more 
important  at  Level  2. 

The  speed  of  provisioning  is  another  major  differentiator  between  Level  1  and  Level  2.  Today,  Level  1  data 
centers  arc  optimized  for  launching  VM  images  that  already  exist  in  their  storage  tier.  They  do  not  provide  fast 
options  for  instantiating  a  new  custom  image.  One  must  either  launch  an  existing  image  and  laboriously  modify  it, 
or  suffer  the  long,  tedious  upload  of  the  custom  image  over  a  WAN.  In  contrast.  Level  2  data  centers  need  to  be 
much  more  agile  in  their  provisioning.  Their  association  with  mobile  devices  is  highly  dynamic,  with  considerable 
churn  due  to  user  mobility.  A  user  from  far  away  may  unexpectedly  show  up  at  a  Level  2  data  center  (e.g.,  if  he  just 
got  off  an  international  flight)  and  try  to  use  it  for  an  application  such  as  a  personalized  language  translator.  For 
that  user,  the  provisioning  delay  before  he  is  able  to  use  the  application  impacts  usability. 
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We  see  at  least  three  different  approaches  to  rapid  provisioning  at  Level  2.  One  approach  is  to  exploit  higher- 
level  hints  of  user  mobility  (e.g.,  derived  from  online  schedules,  travel  information,  real-time  tracking,  explicit  user 
guidance,  etc.)  to  pre-provision  Level  2  data  centers.  A  second  approach  is  to  launch  a  VM  instance  at  Level 
2  without  provisioning  delay,  and  then  demand  page  the  VM  state  as  execution  proceeds.  This  reduces  startup 
delay  at  the  cost  of  unpredictable  delays  during  execution.  The  feasibility  of  this  approach  has  been  shown  in  the 
Internet  Suspend/Resume  system  [30]  and  other  similar  systems.  A  third  approach  is  to  synthesize  the  desired  VM 
state  from  a  pre-cached  base  VM  and  a  relatively  small  dynamically-transmitted  overlay  [31,  32],  Exploring  the 
tradeoffs  and  optimizations  in  this  space  is  important,  but  the  feasibility  of  dynamic  provisioning  is  not  in  doubt. 

Unique  to  Level  2  is  the  problem  of  dynamic  discovery  by  mobile  clients,  as  a  precursor  to  association.  One 
approach  is  manual  selection,  using  a  mechanism  similar  to  what  is  already  in  use  today  for  choosing  WiFi  net¬ 
works  based  on  their  SSIDs.  More  sophisticated  solutions  could  also  be  built,  leveraging  existing  low-level  service 
discovery  mechanisms  such  as  UPnP,  Bluetooth  Service  Discovery,  Avahi,  and  Jini. 

6  When  Level  1  is  Unreachable 

The  hierarchical  organization  of  Figure  17  was  derived  solely  from  considerations  of  performance  and  consolida¬ 
tion.  As  a  bonus,  it  also  improves  availability.  Once  a  Level  2  data  center  has  been  provisioned  for  an  associated 
mobile  device,  WAN  network  failures  or  Level  1  data  center  failures  are  no  longer  disruptive.  This  achieves  dis¬ 
connected  operation,  a  concept  originally  developed  for  distributed  file  systems  [33].  Simanta  et  al  [34,  35]  explore 
the  tradeoffs  between  performance  and  availability  in  dynamically  provisioning  Level  2  from  Level  1 . 

The  improved  availability  of  the  two-level  architecture  applies  even  to  mobile  applications  that  are  not  latency- 
sensitive.  Any  mobile  application  that  uses  the  cloud  for  remote  execution  can  benefit.  Although  not  widely 
discussed  today,  the  economic  advantages  of  data  center  consolidation  come  at  the  cost  of  reduced  autonomy  and 
vulnerability  to  cloud  failure.  These  are  not  hypothetical  worries,  as  shown  by  the  day-long  outage  of  Siri  in 
201 1  [36,  37]  and  the  multi-hour  outage  of  Amazon’s  data  center  in  Virginia  in  June  2012  [38].  As  users  become 
reliant  on  mobile  multimedia  applications,  they  will  face  inconvenience  and  frustration  when  a  cloud  service  for  a 
critical  application  is  unavailable.  These  concerns  arc  especially  significant  in  domains  such  as  military  operations 
and  disaster  recovery. 

7  Related  Work 

This  is  the  first  work  to  rigorously  explore  the  impact  of  mobile  multimedia  applications  on  data  center  consoli¬ 
dation.  The  concept  of  “cyber  foraging”  by  mobile  devices  (i.e.,  leveraging  nearby  resources)  was  first  articulated 
in  2001  [39].  Flinn  [15]  describes  the  extensive  work  on  this  topic  since  then.  A  2009  position  paper  [32]  argued 
that  end-to-end  latency  was  the  critical  determinant  of  whether  public  clouds  were  adequate  for  deeply  immersive 
applications.  It  introduced  the  concept  of  “cloudlets,”  which  correspond  to  Fevel  2  data  centers  in  this  paper.  How¬ 
ever,  that  work  offered  no  experimental  evidence  to  support  its  claim.  While  cloudlets  were  shown  to  be  sufficient 
by  construction,  they  were  not  shown  to  be  necessary.  One  way  to  view  this  paper  is  that  it  provides  the  empirical 
evidence  that  cloudlets  arc  not  a  luxury,  but  indeed  a  necessity  in  the  face  of  real  world  network  connectivity  to 
public  cloud  infrastructure. 

Recent  work  by  others  corroborates  the  conclusions  of  this  paper.  Clinch  et  al.  [40]  explore  the  need  for  logical 
proximity  when  using  a  large  static  display  from  a  mobile  device.  Their  results  arc  consistent  with  the  findings 
reported  here.  Soyata  et  al.  [41]  use  Monte  Carlo  simulation  to  explore  how  a  face  recognition  algorithm  should  be 
partitioned  across  multiple  back-end  computation  engines.  They  conclude  that  a  cloudlet-based  strategy  is  optimal, 
and  provide  experimental  validation. 
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8  Conclusion 


The  convergence  of  mobile  computing  and  cloud  computing  enables  new  multimedia  applications  that  are  both 
resource-intensive  and  interaction-intensive.  For  these  applications,  end-to-end  network  bandwidth  and  latency 
matter  greatly  when  cloud  resources  are  used  to  augment  the  computational  power  and  battery  life  of  a  mobile 
device.  In  this  paper,  we  have  presented  quantitative  evidence  that  latency  considerations  limit  data  center  consoli¬ 
dation.  The  need  to  guard  against  inaccessibility  of  a  distant  data  center  also  limits  consolidation.  We  have  shown 
how  these  challenges  can  be  addressed  by  a  two-level  hierarchical  structure  that  seamlessly  extends  today’s  cloud 
computing  infrastructure. 
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